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Abstract 

Background: When studying genetic diseases in which genetic variations are passed on to offspring, the ability to 
distinguish between paternal and maternal alleles is essential. Determining haplotypes from genotype data is called 
haplotype inference. Most existing computational algorithms for haplotype inference have been designed to use 
genotype data collected from individuals in the form of a pedigree. A haplotype is regarded as a hereditary unit 
and therefore input pedigrees are preferred that are free of mutational events and have a minimum number of 
genetic recombinational events. These ideas motivated the zero-recombinant haplotype configuration (ZRHC) 
problem, which strictly follows the Mendelian law of inheritance, namely that one haplotype of each child is 
inherited from the father and the other haplotype is inherited from the mother, both without any mutation. So far 
no linear-time algorithm for ZRHC has been proposed for general pedigrees, even though the number of mating 
loops in a human pedigree is usually very small and can be regarded as constant. 

Results: Given a pedigree with n individuals, m marker loci, and k mating loops, we proposed an algorithm that 
can provide a general solution to the zero-recombinant haplotype configuration problem in 0(kmn + /cm) time. In 
addition, this algorithm can be modified to detect inconsistencies within the genotype data without loss of 
efficiency. The proposed algorithm was subject to 12000 experiments to verify its performance using different (n, 
m) combinations. The value of k was uniformly distributed between zero and six throughout all experiments. The 
experimental results show a great linearity in terms of execution time in relation to input size when both n and m 
are larger than 100. For those experiments where n or m are less than 100, the proposed algorithm runs very fast, 
in thousandth to hundredth of a second, on a personal desktop computer. 

Conclusions: We have developed the first deterministic linear-time algorithm for the zero-recombinant haplotype 
configuration problem. Our experimental results demonstrated the linearity of its execution time in relation to the 
input size. The proposed algorithm can be modified to detect inconsistency within the genotype data without loss 
of efficiency and is expected to be able to handle recombinant and missing data with further extension. 
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Background 

A genetic disease is caused by the abnormality in an indi- 
vidual's genome. Genetic diseases have been studied 
extensively for decades by investigating the connection 
between diseases and genetic variations. In the human 
genome, chromosomes come in pairs; each gene consists 
of two alleles that reside in different chromosomes at the 
same locus. One of the two alleles comes from the father 
and the other comes from the mother. To study heredi- 
tary diseases in which the genetic variations are passed 
on to offspring, the ability to distinguish between pater- 
nal and maternal alleles is essential. Unfortunately, the 
haplotype structure of a human genome is not available 
directly from the genotyping and the unordered genotype 
data does not tell us which allele comes from which par- 
ent. A haplotype is a collection of alleles at multiple loci 
on a chromosome that tend to be inherited as a unit. The 
determination of haplotypes from genotype data is called 
haplotype phasing or haplotype inference. Algorithms for 
haplotype inference are indispensable and have been 
intensively studied. 

The existing computational algorithms for haplotype 
inference can be classified into statistical and combina- 
torial and most of which were designed for genotype data 
collected from individuals in the form of a pedigree. A 
pedigree is a hierarchical structure that describes the par- 
ent-child relationship among members of a family. Indi- 
viduals without parents are called founders. There may 
be cycles in a pedigree, which are referred to as mating 
loops. A mating loop arises from a couple if they have 
children and both of them are offspring of certain family 
ancestors. An example of a pedigree, coupled with geno- 
type data, is depicted in Figure 1(a); each allele is denoted 
as 0 or 1 to represent its form within a gene. If two alleles 
of a gene are the same, the locus is homozygous; other- 
wise, it is heterozygous. A haplotype is regarded as a her- 
editary unit and therefore an input pedigree is preferred 
to be free of mutational events and to have minimum 
number of genetic recombinational events [1]. Haplotype 
inference under this assumption is referred to as the 
minimum-recombinant haplotype configuration (MRHC) 
problem, which requires the solving of the haplotype 
structure of the input pedigree with the minimum num- 
ber of recombination events [1]. Several algorithms have 
been proposed to solve the MRHC problem [1-8]. A spe- 
cial case of MRHC is zero-recombinant haplotype config- 
uration (ZRHC) problem, which strictly follows the 
Mendelian law of inheritance, namely that one haplotype 
of each child is inherited from the father and the other 
haplotype is inherited from the mother, without any 
mutation [9]. To reduce the complexity of the ZRHC, 
some algorithms have been applied to pedigrees without 
mating loops (called tree pedigrees) [10-12]. In contrast 
to algorithms targeting tree pedigrees, so far no linear- 



time algorithm for ZRHC has been proposed for general 
pedigrees, even though the number of mating loops in a 
human pedigree is usually very small and can be regarded 
as constant; the execution time of existing algorithms for 
ZRHC using general pedigrees is polynomial [4,13-17]. 
Regardless of whether it is a MRHC or a ZRHC problem, 
some algorithms have been extended to handle pedigrees 
with mutations or missing data [5,8,11,15]. In addition to 
haplotype inference from pedigree data, algorithms have 
been proposed for population datasets that come from 
unrelated individuals. Algorithms for population datasets 
try to decode the haplotype structure of each individual 
as well as the haplotype frequencies of a population 
[18-22]. All the above mentioned algorithms are mainly 
combinatorial. Readers who are interested in statistical 
approaches for haplotype inference can consult a recent 
review [23]. 

In this study, we have targeted the ZRHC problem for 
pedigree data. If we assume we are given a pedigree with n 
individuals and m marker loci. Then for general pedigrees, 
Li and Jiang proposed an 0(m 3 n 3 ) time algorithm by con- 
verting the inheritance process into an equivalent linear 
system of 0(mn) equations over Galois field GF(2) and 
invoking Gaussian elimination [4]. Xiao et al. improved 
the method to take 0(mn 2 + n 3 log 2 n log log n) time by 
removing redundant equations from the linear system 
[16]. Doan et al. proposed an 0(mna(m)) time algorithm 
by exploring constraints among marker loci rather than 
family members, where a(-) is the inverse of the Acker- 
mann function [14]. For tree pedigrees, the execution time 
of the algorithm proposed by Xiao can be reduced from 
0(mn 2 + n 3 log 2 n log log n) to 0(mn + n 3 ) [16]. Li and Li 
proposed an 0{mna(n)) time algorithm using disjoint-set 
data structures [11]. Liu et al. further lowered the com- 
plexity of Xiao's algorithm to linear time O(mn) [12]. 
Chan et al. also proposed a linear-time algorithm by main- 
taining a graph structure [10]. Chan's algorithm, however, 
only produce a particular solution. A particular solution 
assigns a numerical value to each system variable, while a 
general solution describes all possible solutions of the sys- 
tem by designating certain variables as free variables and 
the others as linear combinations of these free variables. 

In this paper, we presented an 0(kmn + k 2 m) time 
algorithm that provides a general solution for ZRHC for 
general pedigrees, where k is the number of mating 
loops. In human pedigrees, k is usually very small and 
can be regarded as constant. Our algorithm therefore 
turns out to be linear for most of the practical cases. The 
proposed algorithm was subject to 12000 experiments to 
verify its performance using different (n, m) combina- 
tions. The value of k was uniformly distributed between 
zero and six throughout all experiments. The experimen- 
tal results show a great linearity of the execution time in 
relation to the input size when both n and m are larger 
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Figure 1 A pedigree of 11 members, (a) A pedigree of 1 1 members coupled with genotype data. The paternal haplotype of an individual is 
isted left while its maternal haplotype is listed right, even though the haplotype information is not available from genotyping. For example, the 
paternal and maternal haplotypes of individual n 6 are 0100 and 1110, respectively; the genotype of n 6 , however, is specified as {0, 1J{1, 1}{0, 1}{0, 
0}. Circles represent females and boxes represent males. Children are listed below their parents with line connections. For example, the couple 
n 7 and n 8 have two children n g and n, 0 . There is a mating loop in the pedigree due to the common ancestor n 2 of the couple n 5 and n g . (b) A 
pedigree graph with a spanning tree. Tree edges are solid lines and non-tree edges are dotted lines. The genotype data are represented as 
vectors of g-constant. There is a local cycle of length 4 due to the couple n 7 and n s and their children n g and n 10 . There is a global cycle of 
length 6 due to the mating loop, (c) There are four locus graphs for the different loci. Edges in locus forests are depicted as solid lines. Nodes 
with thick borders are predetermined. 



than 100. For those experiments where norm are less 
than 100, the proposed algorithm runs very fast, from 
thousandth to hundredth of a second on a personal desk- 
top computer. We also showed that the proposed algo- 
rithm can be easily modified to detect inconsistencies 
among genotype data without loss of efficiency. 

Methods 

To apply computational techniques, we transformed the 
input pedigree into a pedigree graph by connecting each 
parent directly to its children (Figure 1(b)). A pedigree 
graph is an undirected graph G = (V, E), where V is a 
set of nodes and E a set of edges. Each node in V repre- 
sents an individual in the pedigree; each pair of nodes is 
connected with an edge in E if and only if the two indi- 
viduals have a parent-child relationship. G is defined to 
be undirected because the computational property of 
each edge is symmetric in our algorithm, even if the 
parent-child relationship is asymmetric. G may contain 
cycles. We only pay attention to two types of cycles: a 
cycle due to a mating loop, which is called a global cycle 
and a cycle due to a couple and two of their children, 
which is called a local cycle. Global cycles and local 
cycles are referred to as basic cycles. For ease of cycle 
processing, we construct a spanning tree T (G) on G. A 
basic cycle can be obtained by adding a non-tree edge 



into T (G). The set of non-tree edges is denoted by E x . 
Non-tree edges are further divided into two disjoint sub- 
sets Ef and Eq ; members in Ef induce local cycles and 
members in E* induce global cycles. Mating loops sel- 
dom appear in human pedigrees and therefore \Ef ; \ = k 
is regarded as a small constant. 

In the rest of this paper, we are assuming that G has n 
nodes and m loci, all alleles are bi-allelic (denoted by 0 
or 1), and the input dataset is free of genotyping errors. 
Under this assumption, the input size of ZRHC is O 
{mn). The genotype data of a node «, are represented as 
a vector g ni of size m. The genotype of at locus /, 
where 1 < / < m, is defined as follows: 



0 if locus I is homozygous and both alleles are 0's 

1 if locus ( is homozygous and both alleles are l's 

2 if locus I is heterozygous 



Genotype data are available, thus all ^-variables can be 
regarded as constant (Figure 1(b)). We introduce a vector 
pm of size m to describe the haplotype information of 
the paternal allele of at locus /, where 1 < I < m, is 
defined as follows: 



Pn\l\ 



0 if paternal allele is 0 

1 if paternal allele is 1 . 
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The vector p n , is regarded as unknown even though 
we know that p n \l\ = gn,['] if «; is homozygous at locus 
/ (i.e. g ni [I] * 2). 

We formulated the ZRHC problem as follows. 

ZRHC Given a pedigree graph G{V, E) with full 
g-constants, determine p ni of each node «/ in V. 

The haplotype configuration of the input pedigree is 
identified by specifying the paternal haplotype of each 
family member. 

A system of linear equations over GF(2) 

In this section, we introduce a system of linear equa- 
tions based on G and g-constants; this system was first 
proposed in [16] and will be reduced to determine all 
/^-variables. Since jj-variables carry binary values, all 
equations in the linear system are defined over GF(2) 
whose operations addition (+) and multiplication (•) are 
shown in Table 1. 

The building block of the system: inheritance 

"Inheritance" is the building block of the system. What 
parents pass to their children must be the same as what 
children receive from their parents. For a parent «; and 
a locus /, k, passes p Hi [I] + 1 to its children if and only 
if the genotype of « ; at locus / is heterozygous and n t 
passes its maternal allele; otherwise passes p n , [I] to 
its children. We introduce two auxiliary variables 
w ni [I] and h„ jinj to formally state the above argument. 
The variable w ni [l] indicates if locus / of «, is heterozy- 
gous. 

... _ f 0 ifgnji] ¥ 2 (i.e. homozygous at locus /) 
I 1 if g nj [I] = 2 (i.e. heterozygous at locus /) . 

The variable h„ uni indicates which allele of k, is passed 
to its child Hj. 

^ | 0 if n, passes its paternal allele to rij 
| 1 if m passes its maternal allele to ty. 

Therefore, Pn,\l] + w n \l] ■ h nji „. represents the allele at 
locus / that «,• passes to nj. 

On the other hand, assume that «, receives an allele 
from n t . If « ; is n/s father, what «, passes to « y - is the 
paternal allele of «,. In this case, we have 
PmU] + w n \l\ ■ h„ iin . = prij[l] . If ni is n/s mother, there are 
two sub-cases. If locus / of « y is homozygous, what n t 
passes to «, must be the same as the paternal allele of 



Table 1 Addition (+) and multiplication (•) in GF(2) 



+ 


0 


1 




0 


1 


0 

1 


0 

1 


1 

0 


0 

1 


0 
0 


0 

1 



My. In this case, we have Pn\l\ + u)„\l\ ■ h niitlj = p nj [l] . 
If locus / of n, is heterozygous, what «, passes to n, 
is the maternal allele of «, and is different from the 
paternal allele of «,. In this case, we have 
Pm [I] + u>m [I] ■ [l] + 1 . The variable w K) [l] can 

be used to indicate if locus / of «, is homozygous or 
heterozygous, the two sub-cases can therefore be combined 
into a single equation p ni [I] + w ni [I] • h ni , nj = p nj [I] + w nj [I] . 
Moreover, if we introduce another auxiliary variable 
dm.rij [I] as follows, 

| 0 if rij is n/s father 
" j '"< 1 J = j w n \l] if Mi is n/s mother, 

the inheritance relationship can be unified into the 
following equation: 

P^ [I] + u> ni [I] ■ h„ jinj = p Hj 

Note that the w- and J-variables are constant by defini- 
tion, and the p- and /z-variables are unknowns. Equation 
(1) formulates the property of edge {n h nj) in G: /7-variables 
and w-constants are attributes of the nodes «, and « y , and 
/z-variables and J-constants describe the inheritance rela- 
tion associated with the edge (n b n/). With the information 
provided by Equation (1), various constraints on /z-vari- 
ables can be generated by traversing different paths in G. 
Our algorithm was designed to first determine /z-variables 
based on these constraints and then the solution to the 
ZRHC problem can be obtained by determining all /^-vari- 
ables based on the solved /z-values and Equation (1). One 
point needs special care: if « y is a child of fon,,n, and 
dnj,m are undefined. In our algorithm, we make the /z-vari- 
ables and (i-constants symmetrical such that h„ jMi = h niiHj 
and d njitli = d ni ^ . 
Linear constraints on h-variables 

To reduce the computational complexity of our algo- 
rithm, we try to make the number of unknowns in the 
coming linear system as small as possible. In the pedigree 
graph G, we have mn ^-variables and at most 2n h-vari- 
ables (since each individual has two parents and there are 
at most n individuals). Observe that if a node «, itself or 
one of its parents is homozygous at locus /, p Hi ['] is 
determined by definition and Equation (1). In this case «, 
is referred to as predetermined at locus / and the number 
of unknown ^-variables is reduced by one. Moreover, for 
an edge {n it n/) e E, where n t is a parent of nj, h nii1lj is 
cancelled from Equation (1) if w nj [l] = 0 at locus /. If 
w n \l\ = 0 holds for all 1 < I < m, no constraints are 
imposed on h„ iinj and it becomes a free variable (or its 
value will finally depend on other free variables). In this 
case the number of /z-variables to be determined is 
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reduced by one, which is equipotent to the removal of 
edge (« ; , nj) from G. Accordingly, w-constants can be 
viewed as the weight of edges in G; we only pay attention 
to edges with weight one (parent nodes that are heterozy- 
gous). To consider only the edges with weight one at 
locus /, we construct the Ith locus graph G t = {V, £/), 
where £; = {(«,-, nj) | «,• is a parent of n v w Hi [I] = 1}. 
Moreover, the spanning forest T(G) n Gi is denoted by 
T{Gj) and is referred to as the Ith locus forest (Figure 1(c)). 

We define constraints on /z-variables by traversing 
paths in the locus graphs. Consider a path p = n 0 , n x , -, 
«, in G/. Assume that « 0 and k, are predetermined and 
all other in-between nodes are non-predetermined. Add- 
ing up all /z-variables on the path will produce the fol- 
lowing equation by Equation (1): 

!-l i-1 

= in + pn, in + J2 d "p"^ w = b - ( 2 ) 

j=0 ;=0 

Since n 0 and are predetermined and all d-constants 
are known, b is a constant. The constant b is said to be 
the constraint of path p. Note that the constraint b does 
not depend on the direction that path p is read because 
the /z-variables and ^-constants are symmetric. More- 
over, if the path is a cycle c = «o, «i, n„ «o in G/, we 
would have the following equation: 

i i 

^2 ^ij.n J+ imodi + i = ^nj,"j + imodi+i ['] = • (?) 
;=0 ;=0 

Again, since all J-constants are known, b' is also a 
constant. The constant b' is said to be the constraint of 
cycle c. On the basis of Equations (2) and (3), we can 
generate constraint equations with only /z-variables for 
cycles or for paths that connect predetermined nodes in 
G[. Constraints can be classified into two categories with 
respect to the spanning tree X(G): cycle and path con- 
straints derived from paths containing non-tree edges, 
and tree constraints derived from paths containing only 
tree edges. 

Cycle and Path constraints 

Adding a non-tree edge e into the spanning tree T (G) 
generates a basic cycle c. If G : contains e, there are two 
cases of c in G;. 

Case 1 c is in G;. A cycle constraint b c of cycle c can 
be obtained by Equation (3). The constraint is denoted 
interchangeably by b c or (b c , e), which is also said to be 
the cycle constraint of e. 

Case 2 c is broken into several disjoint paths in G/ by 
predetermined nodes. Since these paths are disjoint, 
there is exactly one path p' of them containing e. Along 
the path p', we identify a subpath p = «, ...nj containing 
e such that «, and « y - are predetermined and all other in- 



between nodes are non-predetermined. A path 
constraint b p of the subpath p can be obtained by Equa- 
tion (2). The constraint is denoted interchangeably by b p 
or {n i} n jt bp, e), which is also said to be the path con- 
straint of e. Path constraints are symmetric because 
(n b n p bp, e) = {nj, n b b p , e). 
Tree constraints 

For each connected component of T {Gi), we arbitrarily 
pick a predetermined node n s as the seed. For the unique 
tree path p that connects n s and another predetermined 
node n k in the same connected component, a tree con- 
straint b t of path p can be obtained by Equation (2). The 
constraint is denoted interchangeably by b t or {n s b t ). 
Tree constraints are symmetric because {n s , n^ b t ) = («fo 
n s b t ). Note that if there exists a component that has no 
predetermined nodes, locus / must be heterozygous 
across the entire pedigree and no tree constraints will be 
generated. 

Our algorithm in relation to the ZRHC problem 

Our algorithm consists of four steps. We begin by initia- 
lizing required data structures in the preprocessing step. 
The initialized data structures are subject to the con- 
straint generation step to construct a system of linear 
constraints on /z-variables. There are two issues should 
be addressed. First, since all constraints are derived from 
locus graphs that come from the same pedigree graph, 
there is usually redundancy in the system. Second, we 
actually do not need to know all /z-values to solve the 
ZRHC problem. For a child node there are two /z-vari- 
ables related to it and its parents. However, from Equa- 
tion (1) we know that one of the two /z-values is 
sufficient to determine p ni ■ So it is easy to see that the (« 
- 1) /z-variables in T (G) form a minimal sufficient set to 
solve the ZRHC problem. In the third step, constraint 
reduction and transformation, we therefore try to elimi- 
nate redundancy in the system and transform as many 
path constraints into tree constraints as possible. Finally, 
in the haplotype determination step, we introduce an effi- 
cient way to solve /z-variables and further ^-variables 
based on the reduced system. 
Step 1: preprocessing 

The data structures of our algorithm are initialized by 
the following procedures: 

1. Transform the pedigree into a pedigree graph G = 
{V, E). Each node « ; in Kis equipped with its geno- 
type vector gn t . Since each individual has two par- 
ents, there are at most 2« edges in G, so we have 
\V\ = 0{n) and | E \ = 0{n). 

2. Construct a spanning tree T (G) on G. 

3. For each locus /, 

(a) generate a locus graph G b 

(b) generate a locus forest T {Gj), and 
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(c) identify predetermined nodes as well as their 
/j-values, all ^-constants, and all w-constants. 

The operations applied in this step are graph traversal 
and spanning tree construction, both operations can be 
performed in time 0(| V \ + \ E |) = 0(«). The time 
complexity of this step is therefore 0{mn). 
Step 2: constraint generation 

A system of linear equations on /z-variables over GF(2) 
will be constructed in this step. The system consists of 
three sets C c , C p , and C T that contain different kinds of 
constraints. C c contains cycle constraints, if any, of all 
non-tree edges at all loci. Similarly, C P contains path con- 
straints, if any, of all non-tree edges at all loci. Finally, C T 
contains tree constraints at all loci. To reduce computa- 
tional complexity, repetitions of set members are forbid- 
den in our algorithm; we do nothing if an existing 
member is going to be added into the same set. 

There are O(mn) trials to generate a constraint for a 
non-tree edge since there are m locus graphs and each of 
which contains 0(n) non-tree edges; in each trial we per- 
form a cycle detection procedure to generate a cycle con- 
straint or a path constraint, so we have | C c | + | C P \ = 
O(mn). The cycle detection procedure is usually imple- 
mented by depth first graph traversal and its execution 
time depends on the length of the cycle. Consequently, if 
a non-tree edge induces a global cycle, the cycle detec- 
tion procedure takes 0(«) time; otherwise the procedure 
takes constant time because each local cycle contains 
only four edges. The time to generate 0{mri) cycle and 
path constraints is O(kmn) since there are at most km 
trials to generate global cycle constraints. To generate 
tree constraints within a locus graph, we perform tree 
traversal on its locus forest. This procedure generates O 
(«) tree constraints in 0(n) time. So we require 0(mn) 
time to generate tree constraints at all loci. The time 
complexity to generate our constraint system is therefore 
0{kmn) + 0(mn) = 0{kmn). 
Step 3: constraint reduction and transformation 
Redundancy arises in the constraint system if a con- 
straint can be represented as a linear combination of 
other constraints. We are especially interested in the fol- 
lowing two types of redundancies. 

Type 1 Assume there is a basic cycle c in G and it can 
be decomposed into two edge-disjoint paths p\ and p 2 
both connecting nodes «, and « y . There must be exactly a 
non-tree edge e in c, and without loss of generality, we 
assume that e belongs to path p\. If there is a cycle con- 
straint (b c , e) of c, a path constraint rij, b p , e) of p lt 
and a tree constraint {n h n j} b t ) of p 2 , we have b c = b p + b t 
by Equations (2) and (3). That is, these three constraints 
are linearly dependent and each of them can be repre- 
sented as a linear combination of the other two con- 
straints (Figure 2(a)). A path constraint can therefore be 



transformed into a tree constraint by the equation b t = 
bp + b c , which is the basis of the reduction of our con- 
straint system. 

Type 2 Assume there are three tree constraints {n b rip 
bi), («,-, n h b 2 ), and («,, n h b 3 ) of paths p x , p 2 , and p 3 , 
respectively. By definition we know that a tree con- 
straint is the summation of all /z-variables along a 
unique path in T(G), so we have 

h= hn ""r 

b 2 = hn *."y 

(n x ,n y )ep 2 

bl= E Kny 

(n x n r )eps 

Suppose that «; is the node closest to «, on the path 
p 3 . We then have three paths p 4 between «, and n b p 5 
between « ; and tip and p 6 between «; and n k such that 
Pi = Pi + Ps, Pi = Pi + Pe, and p 3 = Ps + Pe- The tree 
constraints can therefore be rewritten as 



b\ 


— E b-n X: n y 


+ E b-n x n r 




{n x n y )ep 4 


[n x ,n r )ep 5 


b 2 


= E bn x n y 


+ E bn x n Y 




(n x n y )(zp4 


{n x n y )ep(, 


h 


J2 b nxtly 






(n x n r )ep 5 


(n x n y )ep 6 



Because all constraints are defined over GF(2), we 
conclude that b 1 + b 2 = b 3 ; the three tree constraints 
are linearly dependent and each of them can be repre- 
sented as a linear combination of the other two con- 
straints (Figure 2(b)). The above argument implies the 
following lemma. 

Lemma 1 For any three nodes n h rij, and n h the tree 
constraint of the path between nj and n k is equal to the 
total tree constraint of the path between n t and nj and 
the path between and n k . 

Lemma 1 still holds even if «, is on the path between 
nj and n k («/ = «/ in Figure 2(b)), which means that if a 
tree path is partitioned into two disjoint sub-paths, the 
tree constraint of this path is equal to the total con- 
straint of the two sub-paths. 

In this step, we remove the type 1 redundancy by 
transforming as many path constraints to tree con- 
straints as possible, and remove the type 2 redundancy 
by reducing C T to an equivalent set whose cardinality is 
at most (n - 1). 

For each non-tree edge e e E x , if cycle constraint (b 0 
e) exists, we remove all path constraints {n b nj, b p , e), if 
any, from C p and add tree constraints («;, b c + b p ) 
into C T . Since the size of C p is 0{mn), this procedure 
can be carried out in time 0{mn), and the new C T is of 
size 0(mn). 
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(a) 



(b) 



path p, 





path p 7 

Figure 2 Two types of redundancy arise from linearly dependence, (a) A cycle c is decomposed into a tree path p 2 and a path p, that contains 
a non-tree edge e. So we have b c = b p + b,, where b a b p , and b, are constraints of cycle c, path p, and path p 2 , respectively. The dotted line 
represents the non-tree edge e. (b) n, is the node closest to n, on the path p 3 . Assume that the constraint of tree path p, is b„ 1 < /' < 6. We have 
6, = b 4 + fa 6 , b 2 = b 4 + ^5. and fa 3 = b 5 + b 6 , which conclude that b, = b 2 + ^3 due to the addition over GF(2). 



To further remove the redundancy in C , we construct 
a constraint graph G* of G. The constraint graph G* 
shares the same set of nodes V as G; for each tree con- 
straint K y > b t ) e C r , we introduce an edge connecting 
nodes «, and in G* with weight b t (Figure 3(a)). An 
example of constraint graph is depicted in Figure 3(b). 
The constraint graph is used to reduce the size of C .As 
shown in Figure 3(b), a constraint graph may not be con- 
nected. Within each connected component in G*, we 
randomly choose a seed n s and try to assign each node «, 
a variable to represents the tree constraint of the 

tree path between nodes n s and «, in the pedigree graph 
G. The assignment is carried out by the following steps 
in each connected component of G*. 

1. W[n s ] of the seed n s is assigned the value zero, 

2. start from n s , perform a breadth-first-search traver- 
sal via tree constraints, i.e., we can traverse from node 
«; to node « y - if («„ np b t ) e C T or (np n b b t ) e C T , 

3. as we traverse from n t to «, through (n h nj, b t ) or 
(nj, nj, b t ), if k, is unvisited, we assign W[ni\ + b t to 
W[nj] based on Lemma 1; otherwise we do nothing. 



Since W[ni\ represents the tree constraint (n s , n b W 
[«/]), it can be regarded as the summation of /z-variables 
along the unique path on T(G) from the seed n s to node 
n it which implies the following lemma: 

Lemma 2 The h-value of a tree edge (n h n^ in T(G) 
can be obtained by h ni , nj = W[n,-] + W[tij] if n t and n t 
reside in the same connected component of G". 

Therefore, if we can assign W-values to all nodes in V 
and make G* connected, G* would be equipotent to a 
reduced C T of size (« - 1) that covers /z-variables of all 
tree edges of T(G) and is sufficient to solve the ZRHC 
problem. The construction of the constraint graph takes 
0(|C T |) = 0(mn) time. 

The constraint graph G* however, may not be con- 
nected with fully assigned W-values. We therefore intro- 
duced an extension procedure to extend G* by adding 
extra tree constraints, if any, into G*; we would like to 
reduce the number of connected component in G* as 
much as possible. To explore more tree constraints to be 
added into G*, we examine those non-tree edges e e E x 
that do not have cycle constraints in C c . The basic idea is 
that if we can synthesize a new cycle whose constraint is 



(a) pedigree graph G 



constraint graph G* 



(b) 




© H © 

© 0\ ©k H 



Figure 3 The concept of a constraint graph, (a) A tree constraint (n„ n t b t ) of the path that connects n, and nj in a pedigree graph G will be 
transformed into an edge between n, and n ; with weight b, in the corresponding constraint graph G*. (b) A constraint graph. There are three 
edges (n 3 , n u ), (n 7 , n 5 ), and (n 9 , n 5 ) in the constraint graph, which means that there are three tree constraints in the linear system. Note that the 
constraint graph is disconnected and contains several connected components. 
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the same as the expected cycle constraint of e, we may 
obtain new tree constraints by transforming known path 
constraints of e. 

For a non-tree edge e without cycle constraint, we try 
to synthesize a cycle only if e e Eg . We do nothing if 
e e Ef because no extra tree constraints of e can be 
obtained by cycle synthesis. To see this, suppose the local 
cycle induced by e connects a couple n a and rib and their 
two children n c and n^; without loss of generality, we 
assume e = (n w n<j) (Figure 4). We can examine the possi- 
ble constraints derived from this local cycle. Constraints 
of a single edge with predetermined endpoints are not of 
interest and can be ignored because the /^-values of the 
endpoints are known; we need only pay attention to con- 
straints whose path lengths are longer than one. In the 
/th locus graph, if w„ a [l] = w nb [l] = 1 , a local cycle exists 
and we have cycle constraint (b c , e) (Figure 4(a)); if 
w„ a [l] = 1 and u> nh [l] = 0, we only have the path p 1 = 
n c n a nd with path constraint {n 0 rid, b p , e) (Figure 4(b)); if 
w n<l [I] = 0 and w nb [l] = 1, we only have the path p 2 = 
n c nbnd with tree constraint (n c , rid, b t ) (Figure 4(c)); if, 
w na [I] = w„ b [I] = 0 all four nodes are predetermined and 
we can determine their ^-values directly (Figure 4(d)). 
No useful constraints other than (b c , e), (n c , n d , b p , e), 
and (n a rid, b t ) can be derived from this local cycle. Here 
we already know that (b a e) does not exist. If {n a rid, b t ) 
is already in C T , it is the only useful tree constraint of e 
and we are finished. If (n 0 b t ) does not exist in C T , we 
cannot obtain {n a rid, b t ) by combining b c and b p because 
(b 0 e) does not exist, even if the path constraint (n a rid, 
bp, e) is available. If this case holds for all 1 < / < m, our 
linear system actually provides no information to obtain 
the tree constraint of p 2 ; the /z-variable of each edge on 
Pi will eventually be assigned a free variable, or its value 
will depend on other free variables. Therefore we do 
nothing if e e E\ . 

Assume that E s is the set of non-tree edges in Eg 
without cycle constraint. Cycle synthesis is carried out 
by concatenating paths with known path constraints or 



tree constraints. The extension procedure is applied to 
E s as follows. 

El. For each e e E s , we check if there is an odd num- 
ber, say 2t + 1, of path constraints of e that link different 
connected components in G* to form a synthetic cycle 
(Figure 5(a)); a constraint is said to link two components 
A and B if one of its endpoints resides in A and the other 
resides in B. There is a special case whereby we can also 
obtain a synthetic cycle if two endpoints of a single path 
constraint reside in the same connected component (t = 
0). If no such 2t + 1 path constraints are found, we can- 
not synthesize a cycle of e and do nothing; otherwise we 
perform the following tasks: 

El.l assign the constraint (b c ,e) to the synthetic 
cycle, where 



b c = J2 W M + W N + b P' 

(ni,nj,b p ,e)€S e 



(4) 



in which S e is the set of the chosen 2t + 1 path con- 
straints; 

El. 2 for each path constraint [n x ,n y ,b' p ,e) in C p , 

generate a tree constraint (n x , n y , b c + b' p ) and add 

the new constraint into G*; 

El. 3 update G*; 

El. 4 remove e from £*; 

E2. If E s becomes empty (there has been a synthetic 
cycle for every e in the original E s ), or no synthetic 
cycle is synthesized (E s stays unchanged), we stop the 
extension procedure; otherwise we go back to El to 
start the next iteration. 

We thus try to synthesize a cycle for each non-tree 
edge in E s to generate new tree constraints and update 
G*. To update G'% if more than one connected compo- 
nent is combined into a new one by new tree constraints, 
we arbitrarily choose one of the old seeds from these 
connected components as a new seed, and perform a 





© 




(d) 



© © 



© © 

Figure 4 All possible appearances of a local cycle in a locus graph. The dotted line represents the non-tree edge e. (a) The local cycle 
appears with cycle constraint (b c , e). (b) There is only one path containing e with path constraint (n c , n d , b p e). (c) There is only one path with 
tree constraint (n c , n d , b,). (d) There are only four predetermined nodes without any constraint. 
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(a) 



constraint graph 



(b) 



pedigree graph 





Figure 5 The concept of a synthetic cycle, (a) Five path constraints 6 1( b 2 , b s link five connected components A, B, C, D, and E to form a 
synthetic cycle in a constraint graph, (b) The conceptual view of the synthetic cycle of (a) in a pedigree graph. The synthetic cycle is actually a 
round trip through the tree edges and the non-tree edge e. The trip is composed of 10 different but connected paths in the pedigree graph. 
In this example, e would be visited five times during the trip. 



graph traversal to update W-values within the new con- 
nected component. A non-tree edge that fails to receive a 
synthetic cycle in a trial of cycle synthesis may benefit 
from a later updated G* and therefore our extension pro- 
cedure is designed to operate in an iterative fashion; the 
procedure terminates only if G*cannot be updated any- 
more. In this procedure, a non-tree edge may be checked 
many times (in different iterations) to form a synthetic 
cycle. In the worst case scenario, only one cycle is synthe- 
sized in each iteration, so we require k iterations to per- 
form k + (k - 1) + ... + 1 = 0{1<?) trials of cycle synthesis. 

To verify the correctness of the extension procedure, 
we need first to explain the meaning of Equation (4). 
Follow a similar argument to that of Lemma 1, for two 
nodes n x and n y that reside in the same connected com- 
ponent of G", we know that W[n x ] + W[n y ] is actually 
the tree constraint of the path from n x to n y on T(G). 
The synthetic cycle is conceptually a round trip through 
tree edges and the non-tree edge e. The value b c in 
Equation (4) is therefore the summation of /z-variables 
along the round trip (Figure 5(b)). Now we demonstrate 
that b c i s the same as the cycle constraint of e. We first 
show that there is exactly one /z-variable of e in b c - 
According to Equation (4), we have 2t + 1 /z-variables of 
e in b . Since we perform additions over GF(2), 2t out 
of the 2t + 1 /z-variables will be cancelled and we finally 
have only one /z-variable of e in ^ . To verify if b c is the 
same as the cycle constraint of e in G, we assume that 
the expected cycle constraint of e is b c . We generate a 



set S' e by converting path constraints [rij, rij, bp, e) in S e 
to tree constraints (rij, rip b c + b p ). It is easy to see that 
the converted 2t + 1 tree constraints also link connected 
components in G* to form a new synthetic cycle, and 
the corresponding round trip only contains tree edges in 
T(G). T(G) has no cycle and therefore each edge of this 
new round trip must be visited an even number of 
times, which means that its /z-variable will be cancelled 
in the new cycle constraint. So the constraint of the 
new synthetic cycle must be zero and we have the fol- 
lowing equations: 

\s' e \ 

[m,nj,b c +b p )€S' e i=l 

Since there are 2t + 1 constraints in S' e , we have 
| S /| 

E' b c = b c - We then obtain b c + b c = 0 and conclude 
i=l 

that b c = b c - 

For each e e E s , the time to determine if there are 
odd number of path constraints that link connected 
components in G* to form a cycle is 0(m). This time 
complexity can be achieved by regarding each connected 
component as a single node and each path constraint of 
e as a single edge, and following O(m) edges to perform 
a depth-first traversal. Since there are Oik 2 ) cycle synth- 
eses throughout the extension procedure, we require 
Oi^rn) time to find synthetic cycles. Once we synthe- 
sized a cycle for e, we require 0{m) time to convert 
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path constraints to tree constraints because there are at 
most m path constraints of e in C . There are 0(/c) non- 
tree edges in E s and therefore the extension procedure 
takes 0{km) time to perform constraint conversion. To 
update G*, we require 0(«) time to perform breadth- 
first traversal on every connected component to modify 
W-values similar to the way we initialize G*. There are 
at most k synthetic cycles and therefore G* is updated 
O(k) times in 0{kn) time. In summary, the Step 3, con- 
straint reduction and transformation, takes 0{k 2 m) + 
0(km) + O(kn) = 0(lcm + kn) time. 
Step 4: haplotype determination 

To solve the /z-values of the tree edges of T{G) by 
Lemma 2, we try to make G* produced by Step 3 con- 
nected. Firstly, we pay attention to the founders in the 
pedigree. Founders cannot be predetermined endpoints 
of paths with either path constraints or tree constraints 
and therefore founders must be isolated nodes in G*. It 
is also impossible to know whether an allele of a foun- 
der is paternal or maternal. We attach a founder K^to 
G* by assuming that it passed its paternal haplotype to 
an arbitrary child n c . The attachment can be done by 
assigning weight zero to the edge {np n c ) of G*, which 
implies h ni ,n c = 0 (nj passes its paternal haplotype). 
There are 0(n) edges in G* and therefore the attach- 
ment of founders to G* takes 0(n) time. 

Secondly, we check if there is any non-tree edge that 
can link any two connected components of G*. A non- 
tree edge e = («„ nj) can link two connected compo- 
nents A and B if we can find a path constraint ni, 
bp, e) of path p that, without loss of generality, satisfies 
the following two conditions: 

1. n k and « ; reside in A and have available W [n k ] 
and W [nt] derived from the seed n A of A, 

2. «/ and «, reside in B and have available W [nj\ and 
W [n^ derived from the seed n B of B. 

If we can find such a non-tree edge e, we can 
decompose p into three parts: a sub-path from n k 
to the non-tree edge e, and the sub-path from «, 
to «/. The constraints of these three parts are 
W[nk] + W[m], h„ ifnj ,W[rij] + W[ni], respectively. This 
turns out that b p = W[n k ] + W[n,] + h ni , n . + W[rij] + W[m] . 
The non-tree edge e therefore can be used to 
link components A and B with known /z-value 
K it nj =b p + W[n k ] + W[m] + W[nj] + W[m] . Since there 
are at most O(mn) path constraints to be checked, this 
procedure requires 0(mn) time. 

Finally, assume that there remain t connected compo- 
nents of G*. We arbitrarily introduce (t -1) edges into G* 
to make it connected. Our algorithm does not impose 
any constraint on these (t - 1) edges and therefore the 



weights of these edges can be safely set as free variables. 
We then update all W-values within the new connected 
G* (new W-values may contain free variables), and apply 
Lemma 2 to determine the /z-values of all edges in T(G). 
With these solved /z-values as well as the w-constants 
and (i-constants, we can determine the /^-values of all 
nodes in the locus graphs by Equation (1). 

The update of G* takes O(n) time. Moreover, we 
require O(w) time to determine all /z-values of edges in 
T(G). For a locus graph, the determined /z-values are 
used to solve all ^-values in O(n) time. Since there are 
m locus graphs, we require 0{mn) time to determine 
the /^-vectors of all nodes in G. Consequently, the three 
procedures of this step take O(w) + O(mn) + O(n) + 
0{mn) = 0(mn) time. 

Results and discussion 

An execution example 

We use the pedigree given in Figure 6(a) as an example 
to demonstrate how the proposed algorithm works. 
There are 19 individuals in the pedigree; eight of them 
are founders. Each individual is equipped with genotype 
data collected from four marker loci. There is a mating 
loop in the pedigree. 

In the first step, we transform the input pedigree into 
a pedigree graph G and construct a spanning tree T{G) 
on G (Figure 6(b)). There are three local cycles A-H-B- 
I-A, D-K-E-L-D, and P-R-Q-S-P and one global cycle B- 
F-C-G-D-L-O-Q-N-I-B in G. Edges A-H, E-L, Q-R, and 
B-F are chosen as non-tree edges within the four cycles. 
From the pedigree graph G, we construct the four locus 
graphs and forests that are depicted in Figure 6(c). The 
^-values of predetermined nodes, w-constants of all 
nodes, and ^-constants of all edges within the four locus 
graphs are also identified. 

In the second step, we generate all cycle, path, and 
tree constraints for each of the four locus graphs using 
Equations (2) and (3). For example, cycle A-H-B-I-A in 
the second locus graph has cycle constraint h A , H + 
h H , b +h B , i +h h A = d Ai H [2] + d Hi B [2] +d B , i [2] + 
d h A [2] = 0 + 1 + 1 + 0 = 0, and path G-C-F-B-I-N-Q 
in the third locus graph has path constraint of 
the non-tree edge B-F /z Gj c + h Ci F + h Fi B + h Bi I + 
h N + h Ni Q = p G [3] + d Gi c [3] + d Ci F [3] + d F B [3] + 
d B , i [3] + d h N [3] + d Nl Q [3] + p Q [3] = 0 + 0 + 0 + 1 
+ 1+ 0 + 0 + 0 = 0. 

At the end of this step we receive C c = {(0, e E . L ), (0, 
e Q - R ), (0, e A . H )}, C p = {(/, H, 0, e A . H ), (N, G, 0, e B . F) , (G, 
Q, 0, e B . F) , (R, Q, 0, e^ Q ), (At Q, 0, e B . F) }, and C T = {(/, 
N, 0), (G, K, 0), (G, L, 0), (G O, 0), (Q S, 0)}. 

In the third step, we obtain two new tree constraints 
(R, Q, 0) and (/, H, 0) by (0, e Q ^) + (R, Q, 0, e^ Q ) and 
(0, e A -H) + (I H, 0, e A ~H)> respectively. The set C T is 
therefore extended to {(/, N, 0), (G, K, 0), (G L, 0), (G 
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Figure 6 An execution example, (a) A pedigree of 19 individuals with genotype data, (b) The corresponding pedigree graph G with a 
spanning tree 7(G). Tree edges are solid lines and non-tree edges are dotted lines, (c) Locus graphs and forests. Nodes with thick borders are 
predetermined, (d) The corresponding constraint graph G*. Nodes and solid lines compose the initial constraint graph. The three dotted lines are 
path constraints that form a synthetic cycle of the non-tree edge B-F. (e) The final G* All edges except B-F have weight zero; h Bi F is a free 
variable. 
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O, 0), (Q, S, 0), {R, Q, 0), (/, H, 0)}. We construct the 
initial constraint graph G* based on the updated C T 
(Figure 6(d)). In the initial G*, we choose H, G, and Q 
as component seeds to determine W-values. We can 
further find that the three path constraints (N, G, 0, e B .f), 
{G, Q, 0, eB.F), and (N, Q, 0, e B -F) link three connected 
components to form a synthetic cycle of the non-tree 
edge B-F with constraint zero. So we further obtain three 
extra tree constraints (N, G, 0), (G, Q, 0), and (N, Q, 0) 
derived from the synthetic cycle and add them to G". 

In the final step, we try to make G* connected to solve 
all /z-values. We first arbitrarily introduce eight edges A- 
H, B-I, C-G, D-G, E-L, J-N, M-O, and P-R to attach the 
eight founders to G"; all the eight edges are of weight 
zero to imply that founders passes their paternal haplo- 
types to one of their children. Now there are only two 
connected components in G*, one of which is an isolated 
node, F. we attach F to G* by set h B , f as a free variable. 
This final connected G* is depicted in Figure 6(e). After 
the final update of G*, all /z-values other than h B> F are 
zero, and h B> F is free to be either zero or one. Given 
these known ^-values, all /^-values over the four locus 
graphs can be solved by Equation (1). 

Time complexity and experimental result 

According to the analyses at the end of each step in 
Section 3, the time complexity of our algorithm is O 
(mn) (step 1: preprocessing) + O(kmn) (step 2: con- 
straint generation) + 0(k 2 m + kn) (step 3: constraint 
reduction and transformation) + 0{mn) (step 4: haplo- 
type determination) = 0(kmn + k 2 m). Because k is 
regarded as a constant, our algorithm is linear. 

To verify the efficiency and the correctness of our algo- 
rithm, we conducted some experiments using the pro- 
posed method. Our algorithm was implemented in C and 
was evaluated on a desktop computer equipped with 
Intel Core i7-2600 3.4 GHz CPU and 8 GB of RAM. The 
desktop ran Ubuntu Release 11.10 operating system with 
Linux kernel 3.0.0-16-generic and GNOME 3.2.1 graphi- 
cal user interface. 

In the experiments, we generated test cases by setting 
different number of individuals («) and markers {m). We 
applied the algorithm developed by Thomas et al. [24] to 
generate 12 tree pedigrees with different n values ranging 
from 30 to 400. To observe how the number of mating 
loops (k) affects our algorithm, each tree pedigree was 
preprocessed to produce four variants with zero, two, 
four, and six mating loops. For each pedigree, we exam- 
ined 10 different m values ranging from 10 to 300. Each 
(«, m) combination was tested 100 times. Each time we 
generated new genotypes and randomly selected one 
pedigree from the four variants of the given n. The haplo- 
type configurations of all the 12000 trials were correctly 
identified. The experimental results are listed in Table 2. 



Table 2(a) shows that unknown /^-variables were cor- 
rectly solved without assigning any free variable if the 
number of marker loci was not less than 30, which covers 
most practical cases in regular genotyping. Free variables 
were required only when the number of marker loci was 
far less than the number of individuals. In this experi- 
ment, free variables were used only when m = 10, and 
they were used at most five times out of 100 trials. The 
result is reasonable because the dimension of the solution 
space of a pedigree with a limited number of marker loci 
is probable less than the number of unknown /7-variables. 

Table 2(b) shows the cumulative execution time of 100 
trials of each (n, m) combination. We received a fluctua- 
tion in execution time if n or m were less than 100. "We 
conjecture that, because the algorithm executes very fast 
for small values of n or m, the cumulative execution time 
might be dramatically affected by the context switches 
within the operating system that ran many background 
services. Furthermore, we believe that when both n and 
m were larger than 100, the execution time of the algo- 
rithm became more significant than that of the context 
switches. From the table it is apparent that the execution 
time is linear for the larger n and m values. 

Finally, Table 2(c) shows that mating loops existed 
evenly throughout all 12000 trials, with the number ran- 
ging from zero to six per pedigree, and the number did 
not affect the linearity of the execution time of our algo- 
rithm in relation to the input size of n and m. 

Issue of spanning tree and seed node selection 

In the first step, preprocessing, a spanning tree T{G) is con- 
structed on the pedigree graph G. As mentioned above, 
T(G) is constructed for the ease of cycle processing; it is 
merely an auxiliary data structure used to generate linear 
constraints of all cycles and paths between predetermined 
nodes in G. We do not impose any constraint on the con- 
struction of T(G) because predetermined nodes are 
defined by genotype data. Once the input pedigree is 
given, all the cycles and paths as well as their constraints 
are bound, no matter which spanning tree is constructed 
on the pedigree graph. Different spanning trees assign dif- 
ferent edges as the non-tree edge in a cycle, and only affect 
the type of a constraint; a constraint may be a path con- 
straint with respective to one spanning tree and a tree 
constraint with respect to another spanning tree. Since dif- 
ferent spanning trees are used to generate the same set of 
constraints, without considering their type, the construc- 
tion of the spanning tree can be arbitrary. In our imple- 
mentation, T(G) was constructed by depth-first traversal. 

In the second step, constraint generation, a seed 
node is arbitrarily selected from T{G) to generate tree 
constraints. To see why the seed node can be selected 
arbitrarily, assume that there are two possible seeds n t 
and n r For any other predetermined node n k , we have 
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Table 2 Experimental results 
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(a) Average number of free variables, (b) Execution time (seconds) to generate solutions. Each entry in the table is the cumulative execution time of 100 
replicates, (c) Average number of mating loops. 



(rip n k , b jk ) = (n b rip b jk ) + (n b n k , b ik ) by Lemma 1, 
which means that a tree constraint seeded with one 
predetermined node is a linear combination of two 
tree constraints seeded with another predetermined 
node. Hence, tree constraints seeded with different 
predetermined nodes are mathematical equivalent; we 
can safely choose any predetermined node as seed. 
Similarily, the seed nodes within a constraint graph 
can also be selected arbitrarily based on the above 
argument. 

Consistency checking 

Although we assume that the input pedigree is free of 
genotyping errors, our algorithm can be easily modified 
to detect inconsistencies within the genotype data with- 
out loss of efficiency. No recombination is allowed in 
the input pedigree and therefore inconsistencies will 



arise if there are different assignments of an /z-value, 
that results in incompatible linear constraints. We may 
designate the following two checkpoints to detect incon- 
sistencies within our linear system: 

1. The generation of constraints. The constraint of a 
path or a cycle may be computed more than one 
time across all locus graphs; all these computations 
should arrive at the same value. So each time we 
compute a constraint, we check if it is the same as 
the current value, if any. 

2. The initialization/update ofG*. There may be loops 
in the constraint graph G" and therefore it is possible 
that there are more than one path from the seed n s to 
a node n h It turns out that W [k,] may be assigned 
more than once in the initialization or update proce- 
dures of G*. By the definition of W- variables, however, 
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all the assignments to are actually associated 

with the same path from n s to « 2 on T (G) and there- 
fore should be identical. So each time we compute a 
W-value, we check if it agrees with the current value, if 
any. 

Conclusions 

In this study, we proposed and implemented an algo- 
rithm to solve the zero-recombinant haplotype config- 
uration (ZRHC) problem for a general pedigree in 
0(kmn + l^m) time. With the aid of free variables, our 
method provides a general solution to describe possible 
haplotype structures within a pedigree rather than a par- 
ticular solution that only assigns a specific numerical 
setting to haplotypes. To the best of our knowledge, this 
algorithm is the first deterministic one to provide a gen- 
eral solution in linear time for pedigrees having small 
number of mating loops. Moreover, the algorithm can 
be easily modified to detect inconsistency among geno- 
type data without loss of efficiency. Our experimental 
results confirm its linearity. In the future, we will try to 
extend the proposed algorithm to handle recombination 
and missing data in linear time for general pedigrees. 
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